Brought to you by EarthWeb
IT Library Logo

Click Here!
Click Here!


Search the site:
 
EXPERT SEARCH -----
Programming Languages
Databases
Security
Web Services
Network Services
Middleware
Components
Operating Systems
User Interfaces
Groupware & Collaboration
Content Management
Productivity Applications
Hardware
Fun & Games

EarthWeb Direct EarthWeb Direct Fatbrain Auctions Support Source Answers

EarthWeb sites
Crossnodes
Datamation
Developer.com
DICE
EarthWeb.com
EarthWeb Direct
ERP Hub
Gamelan
GoCertify.com
HTMLGoodies
Intranet Journal
IT Knowledge
IT Library
JavaGoodies
JARS
JavaScripts.com
open source IT
RoadCoders
Y2K Info

Previous Table of Contents Next


1-7
Fault Tolerance Protection and RAID Technology for Networks: A Primer

JEFF LEVENTHAL

According to a recent Computer Reseller News/Gallup poll, most networks are down for least two hours per week. The situation has not improved for most companies in the past three years. If an organization has 1,000 users per network, this equals one man-year per week of lost productivity. Even if a network is a fraction of that size, this number is imposing. For nearly a decade, many companies responded by deploying expensive fault-tolerant servers and peripherals.

Until the early 1990s, the fault-tolerant label was generally affixed to expensive and proprietary hardware systems for mainframes and minicomputers where the losses associated with a system’s downtime were costly. The advent of client/server computing created a market for similar products created for local area networks (LANs) because the cost of network downtime can similarly be financially devastating. Network downtime can be caused by anything from a bad network card or a failed communication gateway to a tape drive failure or loss of a tape used for backing up critical data. The chances that a LAN may fail increases as more software applications, hardware components, and users are added to the network.

This chapter describes products that offer fault tolerance at the system hardware level and those that use fault-tolerant methods to protect the integrity of data stored on network servers. The discussion concludes with a set of guidelines to help communications managers select the right type of fault-tolerant solution for their network. This chapter also discusses RAID (redundant array of independent [formerly “inexpensive”] disks) technology, which is used to coordinate multiple disk drives to protect against loss of data availability if one of the drives fails.

DEFINING FAULT TOLERANCE

PC Week columnist Peter Coffee noted the proliferation of fault tolerance in vendor advertising and compiled a list of seven factors that define fault tolerance. Coffee’s list included safety, reliability, confidentiality, integrity, availability, trustworthiness, and correctness. Two of the factors—integrity and availability—can be defined as follows:

  Availability is expressed as the percentage of uptime and is related to reliability (which Coffee defined to be mean times between failures) because infinite time between failure would mean 100% availability. But when the inevitable occurs, and a failure does happen, how long does it take to get service back to normal?
  Integrity refers to keeping data intact (as opposed to keeping data secret). Fault tolerance may mean rigorous logging of transactions, or the capacity to reverse any action so that data can always be returned to a known good state.

This chapter uses Coffee’s descriptions of availability and integrity to distinguish between products that offer fault tolerance at the system hardware level and those that use fault-tolerant methods to protect the data stored on the network servers.

Availability

The proliferation of hardware products with fault-tolerant features may be attributable to the ease with which a vendor can package two or more copies of a hardware component in a system. Network servers are an example of this phenomenon. Supercharged personal computers equipped with multiple power supplies, processors, and input/output (I/O) buses provide greater dependability in the event that one power supply, processor, or I/O controller fails. In this case, it is relatively easy to synchronize multiple copies of each component so that one mechanism takes over if its twin fails.

Cubix’s ERS/FT II

For example, Cubix’s ERS/FT II communications server has redundant, load-bearing, hot-swappable power supplies; multiple cooling fans; and failure alerts that notify the administrator audibly and through management software. The product’s Intelligent Environmental Sensor tracks fluctuations in voltage and temperature and transmits an alert if conditions exceed a safe operating range. A hung or failed system will not adversely affect any of the other processors in the system.

Vinca Corp.’s StandbyServer

Vinca Corp. has taken this supercharged PC/network server one step further by offering machines that duplicate any server on the network; if one crashes, an organization simply moves all its users to its twin. Vinca’s StandbyServer exemplifies this process, known as mirroring. However, mirroring has a significant drawback—if a software bug causes the primary server to crash, the same bug is likely to cause the secondary (mirrored) server also to crash. (Mirroring is an iteration of RAID technology, which is explained in greater detail later in this chapter.)

Network Integrity, Inc.’s LANtegrity

An innovative twist on the mirrored server, without its bug-sensitivity drawback, is Network Integrity’s LANtegrity product in which hard disks are not directly mirrored. Instead, there is a many-to-one relationship, similar to a RAID system, which has the advantage of lower hardware cost. LANtegrity handles backup by maintaining current and previous versions of all files in its Intelligent Data Vault. The vault manages the most active files in disk storage and offloads the rest to the tape autoloader. Copies of files that were changed are made when LANtegrity polls the server every few minutes and any file can be retrieved as needed. If the primary server fails, the system can be smoothly running again in about 15 seconds without rebooting. Because all the software is not replicated, any bugs that caused the first server to crash should not affect the second server.

NetFRAME Servers

The fault tolerance built into NetFRAME’s servers is attributable to its distributed, parallel software architecture. This fault tolerance allows the adding and changing of peripherals to be done without shutting down the server, allows for dynamic isolation and connection of I/O problems (which are prime downtime culprits), distributes the processing load between the I/O server and the central processing unit (CPU), and prevents driver failures from bringing down the CPU.

Compaq’s SMART

Many of Compaq’s PCs feature its SMART (Self-Monitoring Analysis and Reporting Technology) client technology, although it is limited to client hard drives. If a SMART client believes that a crash may occur on a hard disk drive, it begins backing up the hard drive to the NetWare file server backup device. The downside is that the software cannot predict disk failures that give off no warning signals or failures caused by the computer itself.


Previous Table of Contents Next

footer nav
Use of this site is subject certain Terms & Conditions.
Copyright (c) 1996-1999 EarthWeb, Inc.. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Please read our privacy policy for details.